22 research outputs found

    Sampling Correctors

    Full text link
    In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log2n)O(1/\log^2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We also consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process

    Deterministic Distributed Algorithms and Lower Bounds in the Hybrid Model

    Get PDF
    The HYBRID model was recently introduced by Augustine et al. [John Augustine et al., 2020] in order to characterize from an algorithmic standpoint the capabilities of networks which combine multiple communication modes. Concretely, it is assumed that the standard LOCAL model of distributed computing is enhanced with the feature of all-to-all communication, but with very limited bandwidth, captured by the node-capacitated clique (NCC). In this work we provide several new insights on the power of hybrid networks for fundamental problems in distributed algorithms. First, we present a deterministic algorithm which solves any problem on a sparse n-node graph in ??(?n) rounds of HYBRID, where the notation ??(?) suppresses polylogarithmic factors of n. We combine this primitive with several sparsification techniques to obtain efficient distributed algorithms for general graphs. Most notably, for the all-pairs shortest paths problem we give deterministic (1 + ?)- and log n/log log n-approximate algorithms for unweighted and weighted graphs respectively with round complexity ??(?n) in HYBRID, closely matching the performance of the state of the art randomized algorithm of Kuhn and Schneider [Kuhn and Schneider, 2020]. Moreover, we make a connection with the Ghaffari-Haeupler framework of low-congestion shortcuts [Mohsen Ghaffari and Bernhard Haeupler, 2016], leading - among others - to a (1 + ?)-approximate algorithm for Min-Cut after ?(polylog (n)) rounds, with high probability, even if we restrict local edges to transfer ?(log n) bits per round. Finally, we prove via a reduction from the set disjointness problem that ??(n^{1/3}) rounds are required to determine the radius of an unweighted graph, as well as a (3/2 - ?)-approximation for weighted graphs. As a byproduct, we show an ??(n) round-complexity lower bound for computing a (4/3 - ?)-approximation of the radius in the broadcast variant of the congested clique, even for unweighted graphs

    Efficient Statistics, in High Dimensions, from Truncated Samples

    Full text link
    We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a dd-variate normal N(μ,Σ){\cal N}(\mathbf{\mu},\mathbf{\Sigma}) means a samples is only revealed if it falls in some subset SRdS \subseteq \mathbb{R}^d; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean μ\mathbf{\mu} and covariance matrix Σ\mathbf{\Sigma} can be estimated with arbitrary accuracy in polynomial-time, as long as we have oracle access to SS, and SS has non-trivial measure under the unknown dd-variate normal distribution. Additionally we show that without oracle access to SS, any non-trivial estimation is impossible.Comment: to appear at 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 201

    Secretary and Online Matching Problems with Machine Learned Advice

    Get PDF
    The classical analysis of online algorithms, due to its worst-case nature, can be quite pessimistic when the input instance at hand is far from worst-case. Often this is not an issue with machine learning approaches, which shine in exploiting patterns in past inputs in order to predict the future. However, such predictions, although usually accurate, can be arbitrarily poor. Inspired by a recent line of work, we augment three well-known online settings with machine learned predictions about the future, and develop algorithms that take them into account. In particular, we study the following online selection problems: (i) the classical secretary problem, (ii) online bipartite matching and (iii) the graphic matroid secretary problem. Our algorithms still come with a worst-case performance guarantee in the case that predictions are subpar while obtaining an improved competitive ratio (over the best-known classical online algorithm for each problem) when the predictions are sufficiently accurate. For each algorithm, we establish a trade-off between the competitive ratios obtained in the two respective cases

    Learning Augmented Online Facility Location

    Full text link
    Following the research agenda initiated by Munoz & Vassilvitskii [1] and Lykouris & Vassilvitskii [2] on learning-augmented online algorithms for classical online optimization problems, in this work, we consider the Online Facility Location problem under this framework. In Online Facility Location (OFL), demands arrive one-by-one in a metric space and must be (irrevocably) assigned to an open facility upon arrival, without any knowledge about future demands. We present an online algorithm for OFL that exploits potentially imperfect predictions on the locations of the optimal facilities. We prove that the competitive ratio decreases smoothly from sublogarithmic in the number of demands to constant, as the error, i.e., the total distance of the predicted locations to the optimal facility locations, decreases towards zero. We complement our analysis with a matching lower bound establishing that the dependence of the algorithm's competitive ratio on the error is optimal, up to constant factors. Finally, we evaluate our algorithm on real world data and compare our learning augmented approach with the current best online algorithm for the problem

    Sample-Optimal Identity Testing with High Probability

    Get PDF
    We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well
    corecore